Explainable machine learning approach for cancer prediction through binarilization of RNA sequencing data

In recent years, researchers have proven the effectiveness and speediness of machine learning-based cancer diagnosis models. However, it is difficult to explain the results generated by machine learning models, especially ones that utilized complex high-dimensional data like RNA sequencing data. In this study, we propose the binarilization technique as a novel way to treat RNA sequencing data and used it to construct explainable cancer prediction models. We tested our proposed data processing technique on five different models, namely neural network, random forest, xgboost, support vector machine, and decision tree, using four cancer datasets collected from the National Cancer Institute Genomic Data Commons. Since our datasets are imbalanced, we evaluated the performance of all models using metrics designed for imbalance performance like geometric mean, Matthews correlation coefficient, F-Measure, and area under the receiver operating characteristic curve. Our approach showed comparative performance while relying on less features. Additionally, we demonstrated that data binarilization offers higher explainability by revealing how each feature affects the prediction. These results demonstrate the potential of data binarilization technique in improving the performance and explainability of RNA sequencing based cancer prediction models.


Introduction
Machine learning (ML) models have been used for cancer research for almost 40 years.In the past, researchers primarily focused on using clinical and demographic data to individual's risk of developing cancer [1].Recent advancements in genomic and computational technology has enabled researchers to study cancer more thoroughly and develop new models for cancer prediction and survival analysis [2][3][4][5].
One proven way to study cancer using computational and ML-based methods is by analyzing peptides, specifically anti-cancer peptides (ACPs) data.Because of their low toxicity and greater efficacy, ACPs have recently attracted researchers' interests as a promising therapeutic agent for cancer treatment.However, efficient identification of ACPs is still a challenge.To address this issue, researchers have proposed multiple ML-assisted tools for prediction of ACPs.Some early examples of utilizing ML models such as support vector machine (SVM) to identify ACPs are Chou's pseudo-amino acid composition (PseAAC) and sequence-based identification of anticancer peptides (iACP) [6,7].These methods paved the foundation for approaches that incorporated more advanced ML techniques like feature selection, dimensionality reduction, ensemble learning, and genetic algorithm.To further improve the performance of iACP method, a novel model called iACP-GAEnscC was developed based on ensemble learning and evolutionary genetic algorithm [8].Dimensionality reduction techniques like principal component analysis (PCA) was also used to develop effective models like cACP for prediction of ACPs [9].In particular, the cACP model was further developed into cACP-2LFS model, which utilized a two-level feature selection method to improve performance on existing models, and cACP-DeepGram model, which achieved better performance by incorporating a FastText-based word embedding strategy to represent each peptide sample [10,11].
Another proven way is to utilize patients' RNA sequencing (RNA-seq) data.However, using RNA-seq data for cancer prediction poses a challenge to researchers because of their high dimensionality, complexity, and redundancy, which could lead to decrease in accuracy and efficency [12].To combat this issue, researchers utilized dimensionality reduction and feature selection methods, such as univariate feature selection [13,14], stepwise feature selection [15], PCA [16], autoencoder [17][18][19], and hybird approaches [20].However, both feature selection and dimensionality reduction have some hard-to-fix drawbacks.Because feature selection functions by extracting a subset of features that is more related to the label, it ignores the inter-relationships between features [21].On the other hand, interpreting new data of a lower dimension generated by dimensionality reduction techniques like PCA and autoencoder is difficult because one new feature could correlate to multiple original features [22].
Interpreting a ML model has always been a difficult task.Fundamentally, ML models can be divided into two groups: white box models and black box models.White box models like decision tree and logistic regression are easiler to interpret because they have built-in feature importance that explains how each feature contributes to predictions [23].Black box models, on the other hand, must rely on post-hoc explanation approaches.A few popular post-hoc explanation techniques are Local interpretable model-agnostic explanations (LIME) [24], SHapley Additive exPlanations (SHAP) [25], saliency map, and counterfactual explanations.Although existing peptides and RNA-based models have shown encourging performances in producing accurate predictions, they are limited in terms of interpretability.Therefore, researchers have developed several models to address this issue.For peptides-based models, both white box and post-hoc explanation techniques were used.White box models like ACPred used rule extraction on random forest models to extract decision rules [26].On the other hand, post-hoc explanation like LIME and SHAP are the preferred choice for explaining more complexed models that are based on neural networks or ensemble learning like iAFPs-EnC-GA, AIPs-SnTCN, and ACPred-BMF [27][28][29].For RNA-based models, most studies chose post-hoc explanations like SHAP because of the complexity of the algorithms used [30,31].Despite having many choices, interpreting models that use continuous features is still difficult.For continuous features, most explanation techniques that generate feature importance scores only shows the names of importance features and how each of them contributes to predictions in terms of importance-like scores.This kind of interpretation is not meaningful as users would not understand how each feature, in terms of its original value, contributes to predictions.To combat this issue, researchers proved that, by binarilizing continuous features, interpreting ML models become much easier as each feature is meaningful [32].
Data binarilization has already been proven to be able to increase the interpretability of ML models while maintaining predictive accuracy.Similar approach has been used to identify ACPs and demonstrated its effectiveness [33].However, this technique has never been applied to RNA-seq data before.Therefore, we propose a data binarilization-based approach to increase the performance and interpretability of ML models used for cancer prediction.
The contributions of this study are: • Data binarilization technique is proposed for processing RNA-seq data.
• Multiple models were constructed and tested to examine the effectiveness of the proposed technique.
• Models using proposed technique were compared with models using other state-of-the-art techniques.
• Models using proposed technique showed promising results.
• Proposed technique provides easier-to-understand explanations.
This paper is divided into five sections.Section one consists of reviews on the use of ML models in cancer prediction and ML techniques on interpretability.Section two delineates the methodology of the study, including data collection, data binarilization, feature selection, classification model, hyperparameter search, performance metrics, and model explanaation.Results and analyses are located in section three.Section four discusses the advantages of our proposed technique, contributions and limitations of this study, as well as future research plans.Finally, sections five concludes this paper by providing an overview of this study and future research directions.

Materials and methods
In this section, we present our approach for cancer diagnosis using binarilized RNA-seq data.The approach consists of four parts: data binarilization, feature selection, model construction, and model explanation.First, raw RNA-seq data are binarlized.Then, univariate feature selection is applied to select the most relevant features from binarlized data.Processed data are then randomly splitted into a training set and a test set.The training set is used for both hyperparameter search and model training; whereas the test set is used for measuring the performance of trained models.The train-test process is repeated 10 times to get the average results.Models with the highest F1 scores is used to generate SHAP plots.And finally, SHapley Additive exPlanations is used to interpret trained models and determine most impactful features.

Data collection
All data used in this study came from the National Cancer Institute Genomic Data Commons (GDC).Log 2 (x + 1) normalized illumina Hi-Seq RNA sequencing data was merged with clinical information based on their corresponding sample IDs.Samples without RNA-seq information were removed.Samples with primary, recurrent or metastatic tumor were considered as positive samples, while samples with solid tissue standard samples were considered as negative samples.In this study, only RNA-seq data were used, each of which contains the same 20,530 predictors.The properties of each dataset can be found in Table 1.
All four datasets were normalized using the min-max normalization technique before further processing.All positive samples were labeled as 1, whereas negatives were labeled as 0. After completing all processing, each dataset was divided into a training set and a test set using stratified split.The ratio between a training set and a test set was 80:20.

Data binarilization
Data binarilization can be seen as an extension of data discretization.Data discretization is the process of converting a feature of continuous values into a finite set of intervals, with each interval representing a range of original values.Although data discretization can reduce the complexity of a dataset, it does not make any feature more interpretable as each feature still contains more than two values.Data binarilization, on the other hand, creates a series of binary features for each continuous feature, with each binary feature representing a range of original values.For each set of binary features, each sample shall have only one positive value, representing the range of values the sample belongs to using the original continuous feature.Because of this characteristic, binarilization offers a more direct view than discretization on the relationships between features and outcomes generated by feature selection and model explanation methods.
A data binarilization tool can be constructed using Algorithm 1.

Algorithm 1 Data Binarilization Algorithm
Require: Dataset D, I samples, J features and K binary features function DATA BINARILIZATION(D, K)

Feature selection
Because of the high dimensional nature of RNA-seq data and the additional dimensions created by the binarilization process, feature selection was used to remove irrelevant features to reduce overfitting and increase efficiency [34].In this study, the Chi-Square test (Chi2) was selected.Chi2 is a statistical test primarily used to determine the level of dependency between two categorical variables.For each feature in the feature set, the corresponding χ 2 value is calculated by Eq 1.All features are then ranked in descending order according to their corresponding χ 2 values, with higher χ 2 value indicating higher dependency between the feature and the label.Top-k features with the highest χ 2 values are selected to form the reduced feature set, where k is the number of features chosen by the user.In this study, we chose 20, 200, 2000, 20000 features, representing 0.01%, 0.1%, 1%, and 10% of binarilized features.
In the above formula, m denotes the number of features in the feature set, n denotes the number of distinct labels, O ij denotes the observed frequency of having feature as i and label as j, and E ij denotes the expected frequency of having feature as i and label as j.

Classification model
In order to examine the effectiveness of our proposed data processing technique, we chose to compare the performance of five models based on different algorithms, namely neural network, decision tree, random forest, XGBoost, and support vector machine.
Neural network.Neural Network (NN) essentially is just a stack of interconnected layers of nodes.The typical structure of a NN consists of an input layer, one or more than one hidden layers, and an output layer.Each node in the hidden and output layers is connected to all nodes in the previous layer [19].Each connection has a associated weight.An activation function is also attached to each node in all but the input layer.The activation of an node is determined by the activation function, which calculates the activated value of that node base on all incoming connections.During the training process, the model adjusts all connections' weights to correctly predict the label of the input data.Sometimes, a dropout rate is also attached to each hidden layer to increase the generalizability of the model.Dropout is a process that reduces the overfitting problem by randomly deactivating some nodes in the hidden layers during the training process.Comparing to other ML algorithms, NN has the advantage of being able to detect the non-linear relationships between input data and output labels.However, NN is computationally more expensive than other algorithms, which limits its applicable areas.
Decision tree.Decision tree (DT) is a powerful ML algorithm that is widely used [35].A DT is made of two types of nodes and one-directional links.Each internal (non-leaf) node represents a test on an attribute, each link represents an outcome of the test, and each terminal (leaf) node represents a class label [36].Comparing to other algorithms, it is more interpretable because of the tree-like structure, which can be easily converted to decision rules.The tree strcuture is also ideal for capturing interactions between features [37].
Random forest.The random forest (RF) algorithm is a decision tree-based bagging ensemble algorithm [38].Comparing to other algorithms, RF is more robust against noise in data, has higher scalability, and offers strong performance in high-dimensional settings [39].A random forest classifier generates multiple decision trees.Since not every tree uses all available features or samples, prediction made by each tree is different.Predictions from all trees are then collected and the label that most trees predict will be the final prediction.This process is called the majority voting process.
XGBoost.XGBoost is a decision tree-based boosting ensemble learning algorithm.Comparing to all previous boosting tree algorithms, XGBoost offers higher performance and efficiency [40].Like RF, a XGBoost classifier also generates multiple decision trees.However, unlike bagging ensemble algorithms, boosting ensemble algorithms like XGBoost utilizes a iterative approach to train each sub model, then sequentially combine all sub models to form the final model.The weights of misclassified data in one model will be increased in the next model.Because of this iterative approach, boosting ensemble algorithms are more prone to the overfitting problem.
Support vector machine.Support vector machine is a classic ML algorithm for both linear and nonlinear data.In a classification problem, SVM searches for a decision boundary called hyperplane that classify input samples into one of two classes.To achieve this, SVM utilizes kernels to create a feature space in a higher dimension where linear separation is possible.Therefore, SVM is sensitive to kernel choice.SVM is considered highly accurate and less likely to overfit.However, SVM is less efficient than other algorithms [19].

Hyperparameter search
In order to prevent overfitting and gain maximum performance, hyperparameter search was conducted for each model before training.Because of the lack of knowledge in determining the optimal values for some hyperparameters, random search was used in this study.In this study, each model had 200 variations; the value of each searched hyperparameter of each variation was picked randomly.Based on the F-Measure of each variation, the best performing one was chosen as the predictive model.Due to the small sample sizes of our datasets, 10-fold cross-validation was used in the hyperparameter search processes to avoid over/under-fitting [41].Tables 2 to 6 contains the searched range for each hyperparameter for each algorithm.

Evaluation metrics
After constructing and training the prediction models, we used the test sets to evaluate their performance.Because our datasets were imbalanced, metrics designed for imbalanced classification tasks.Specifically, we chose geometric mean (GMean), Matthews correlation coefficient (MCC), F-Measure (F1), and area under the receiver operating characteristic curve (AUC) as our metrics [42][43][44].We also included accuracy (ACC) as a metric although it is not particularly suitable for imbalanced classification tasks.Accuracy: measures how many predictions are correct.It is calculated by dividing the number of correct predictions from the total number of predictions made, as shown in Eq 2. Sensitivity: measures how well the positive class was predicted by calculating the positive rate, as shown in Eq 3.
Specificity: measures how well the negative class was predicted by calculating the negative rate, as shown in Eq 4. Geometric Mean: is the squared root of the product of the sensitivity and specificity.It is calculated by Eq 5.

Geometric Mean ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Sensitivity * Specificity p Matthews correlation coefficient: calculates the Pearsonproduct − momentcorrelationcoefficient between correct and predicted values [10.1186/s12864-019-6413-7], as shown in Eq 6.

MCC ¼
T p * T n À F p * F n ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Precision: measures the rate of correctly predicted positive samples, as shown in Eq 7.
Recall: is calculated the same way as sensitivity.F-Measure: is the harmonic mean of precision and recall.It is calculated by Eq 8.
Area under the receiver operating characteristic curve: measures the entire area underneath the receiver operating characteristic curve.It can be calculated by Eq 9.
In the above formulas, T p denotes the number of correctly predicted positive samples, T n denotes the number of correctly predicted negative samples, F p denotes the number of incorrectly predicted positive samples, F n and denotes the number of incorrectly predicted negative samples.For AUC, S p denotes the sum of the ranks of all positive samples, whereas n p and n n denote the number of positive and negative samples respectively.

Model explanation
Being a black box algorithm, interpreting a random forest is difficult due to the lack of built-in feature importance.To address this issue, we decided to use a popular post-hoc explainer called SHapley Additive exPlanations (SHAP) to explain the relevance of each feature.SHAP is built on basis of game theory concepts, specifically the shapley value.Shapley values are based on the idea that the outcome of a prediction should determine the importance of each feature involved.SHAP utilizes this idea by constructing multiple models with the same set of hyperparameters and training data, but different sets of features.After models are created, the marginal contribution of each feature is calculated by finding the difference between 1) the difference between the prediction of a model with that feature and the average prediction and 2) the difference between the prediction of a model without that feature and the average prediction.Then, SHAP value for each feature is calculated by taking the average of all marginal contributions of that feature [25].In this study, we chose SHAP because it is able to offer a globally consistent explanation [45].Specifically, we chose to use TreeExplainer that is built to explain ensemble tree models [46].

Environmental setup
Program built for this study ran on machines in the Sun Lab at the Penn State Harrisburg.The machines in the Sun Lab ran on Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz.It provided a RAM of size 128 GB.Our study was implemented using Python 3.10.6.Imbalanced-learn 0.9.1 was used for the implementation of GMean metric.Scikit-learn 1.2.2 was used for implementing Chi-Square feature selection, hyperparameter search, three classifiers including decision

Performance of models
We compared the performance of models using our proposed data processing technique.To examine how feature selection will affect models that use binarilized data, we ran several experiments with different number of binary features selected.The results of all experiments can be found from Tables 7 to 10.By examining Tables 9 and 10, we can see that the performance of DT is slightly worse than other methods, which is to be expected as other methods theoretically should perform better than DT because of their complexity.Overall, the increase in number of features didn't affect the performance of models by a large margin.We also compare the performance of models using our proposed data binarilization technique with other models based on other feature selection or dimensionality reduction techniques.By examining Tables 11 to 14, we can see that models using our proposed technique, despite having to rely on less features because of binarilization, perform about the same as models using other techniques.This proves that not only certain genes, but also some value ranges of certain genes are not relevant to cancer prediction.

Explanation results
To explain the relationship between input features and output labels, SHAP beeswarm plots were generated to visualize such relationships for each model.In a beeswarm plot, the feature names on the left side are ranked from top to bottom base on their corresponding absolute SHAP values.The horizontal axis represents the SHAP value of each data point.In places where multiple data points share the same SHAP value, dots are stacked vertically.The line at the 0 point separates samples that negatively contribute to the prediction from ones that have a positive contribution.We can see that, because our test data was imbalanced, the number of samples on the left side of the 0-line is significantly smaller than the one on the right side.The colorbar on the right side represents the value of the corresponding feature of each data point.
Based on these properties, we can see that data binarilization gives a more direct view on the relationships between each gene and the predicted outcome by examining Figs 2-5, being able to show not only the most impactful features, but also their associated value ranges.Since each binary feature can only be either 0 or 1, SHAP plots for binarlized data only has two colors representing feature values.

Discussion
In this study, we propose a novel data processing technique for cancer-related RNA-seq data.After binarilization, each gene is splitted into ten binary features.For each sample only one of the ten binary features is positive, indicating the range of values the sample's original value of that feature lies between.Because of data binarilization will increase the number of features, we performed feature selection to filter out irrelevant features.We compared the performance of models using binarilized features with models using continuous features.The results show that data binarilization does not affect the predictive performance of models.For model explanation, we used SHAP to rank all features in terms of relevance to prediction.Comparing to other explanation models that use continuous data like iAFPs-EnC-GA, AIPs-SnTCN, and OncoNetExplainer, using binarlized data makes understanding results of SHAP analysis easier because the relevant features along with the relevant value range of the feature are revealed together.
Although we presented a novel approach that shows promising results, there are still some limitations to this study.First, the number of samples in each dataset was quite small.Thus, this work represents a proof-of-concept study for the data binarilization approach.In future, we plan to apply this approach to larger datasets to examine its effectiveness.Second, the datasets were highly imbalanced.Although there are several imbalance treatment options like oversampling and undersampling, we decided not to use them as oversampling could create invalid samples that has more than one binary feature that belong to the same original feature be positive, whereas undersampling will further decrease the number of samples.Therefore, we included metrics designed for imbalance classification tasks to mitigate this issue.Third, because some of the models included in this study were black box models, SHAP was used to provide explanations for them.However, post-hoc explanation techniques like SHAP is prone to biases exist in training data, which could lead to misleading and unfaithful explanations of models.Therefore, researchers suggest that white box models should be developed for fields involve high-stakes decision making [47].Following such suggestions, we plan to utilize our proposed data processing technique to build more interpretable white box models in our future studies.Last but not least, the hidden properties of the datasets, such as the demography of patients and collection time, could have significant impact on the performance of datadriven techniques like ML models.We plan to collaborate with clinicians to test our proposed approach in the real world in our future studies.

Conclusion
Early detection of cancer can increase patients' survival chances [48].Recent developments in technology have enabled the use of new diagnostic methods like ML-assisted models.Because ML models excel at processing complex data, this allows researchers to utilize high-dimensional data like RNA-seq data to predict cancer patients and extract relevant biomarkers.However, ML models suffer from problems like poor interpretability.In this study, we proposed a novel approach that utilize data binarilization to increase interpretability of ML models for cancer diagnosis.We proved that models using binarlized data can achieve the same level of performance while relying on less features.We also showed that data binarilization offers higher interpretability by offering a direct view on how each feature impacts the outcome.In future, we plan to address some limitations mentioned above by proposing new algorithms and establishing collaborations with clinicians.We also plan to apply this approach to address other healthcare problems.

Fig 1
demonstrates the flow of our proposed approach.